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Abstract 

In this paper, we present a dataset consisting of data 
generated from 22 previously and currently running 
randomized controlled experiments inside the 
ASSIStments online learning platform. This dataset 
provides data mining opportunities for researchers to 
analyze ASSISTments data in a convenient format 
across multiple experiments at the same time. The 
data preprocessing steps are explained in detail to 
inform researchers about how this dataset was 
generated. A list of column descriptions is provided to 
define the columns in the dataset and a set of summary 
statistics are presented to briefly describe the dataset. 
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Introduction 

ASSISTments 

ASSISTments is a freely available online learning 
platform mostly used in mathematic classes between 
grades 4-12 in the United States. ASSISTments 
currently has 140,032 users who have done 
32,625,908 problems from years 2013-2015. Most of 


the users are middle school (grades 6-8) mathematic 
students, who are located in or near Massachusetts. 
Students use ASSISTments on classwork and 
homework, which may be done with or without the use 
of a paper copy. Instant feedback is typically provided 
to the students upon answering problems, so that 
students will immediately know if they have answered 
the problem correctly. Students are not allowed to skip 
any problems and must answer correctly in order to 
continue on to the next problem. If a student does not 
answer the problem correctly on their first attempt, the 
problem is marked as incorrect. Almost all problems in 
ASSISTments give the student the option to click on a 
hint button, which either offers additional help to the 
student or just gives the student the answer to the 
problem, so that he or she may continue. 

Experiment Infrastructure 

All of the experiments were created by either internal 
or external researchers working with ASSISTments. 

Dr. Heffernan was funded by NSF to create the 
ASSISTmentsTestBed.org to support external 
researchers to conduct studies. The 
ASSISTmentsTestBed.org is a reporting service that 
helps researchers retrieves data on experiments that 
have been run inside the ASSISTments system [1, 2], 
This reporting service is also knows as the Assessment 
of Learning Infrastructure (ALI), which combines 
automated analysis with the data reporting [6], 

Several of these experiments have been previously 
published, with topics ranging from student confidence, 
student choice, and buggy messages [3-5, 8], 


Data Selection 

In the ASSISTments online learning platform, there are 
several ways to build assignment structures which 
experiments can be embedded inside of. One type of 
structure is called a Skill Builder. A Skill Builder is an 
assignment type that consists of a large number of 
similar problems, where students must answer a 
specified number of problems (usually three) correctly 
in a row on the same day in order to finish the 
assignment. These problems are generated by 
variabilized templates, where several problem instances 
are generated from the same template with different 
numbers replacing the variables of the template [7], 

We chose only to include experiments that leveraged 
the Skill Builder infrastructure because that was the 
most common assignment structure. We also chose to 
only use experiments with a control group and a single 
experimental group since that was the study design for 
most of the experiments. Using only Skill Builder 
assignments with a single experimental condition allows 
for a common set of independent and dependent 
variables to be used for all 22 experiments. 

To ensure that there was a large enough sample of 
students for each experiment, only experiments that 
had over fifty students complete the assignment in all 
the conditions of the experiment were chosen. Also, 
only experiments that had less than 10% corrupted 
data were used. 

Data Pre-Processing 

The following steps were done to clean the data and 
transform all the experiment data into a canonical form 
for analysis. First, the original data was retrieved for 
all the assignments that were known to be experiments 
by using the ASSISTmentsTestBed.org. Once this data 



Experiment 

Number 

Students 

Problems 

1 

1,204 

7,657 

2 

695 

9,466 

3 

381 

5,105 

4 

627 

9,771 

5 

1,864 

12,549 

6 

457 

1,861 

7 

515 

2,068 

8 

540 

2,212 

9 

402 

2,213 

10 

545 

8,030 

11 

412 

2,612 

12 

335 

1,817 

13 

497 

3,930 

14 

704 

3,197 

15 

523 

2,967 

16 

136 

1,526 

17 

428 

2,930 

18 

745 

3,776 

19 

1330 

4,694 

20 

1,917 

9,268 

21 

429 

2,538 

22 

426 

2,065 


Table 1 . Number of students and 
problems done for each experiment. 


was retrieved, problems which were not common to all 
the experiments were removed. These problems 
included any pretest and posttest problems that an 
experiment may have contained as well as scaffold 
problems. Nearly all the prior experiments did not 
have pretest or posttest problems, since the ability to 
create such experiments did not exist in the 
ASSISTments system at the time those experiments 
were created. Scaffold problems are sub-problems for 
a given original problem, which break the original 
problem down into a few smaller sub-problems. 

Scaffold problems were removed because only eleven 
experiments contained scaffold problems and scaffold 
problems are considered a type of tutoring which does 
not count toward the correctness of the Skill Builder 
assignment. 

A recent improvement to the ASSISTments system was 
the addition of logging the condition students were 
randomly assigned into. Unfortunately this data did not 
exist at the time most of the experiments were run, 
therefore the condition that students were assigned 
into was derived from the problems they had been 
assigned. The condition students were randomly 
assigned into was derived based on distinguishing 
problem numbers that the student had worked on. 

Several experiments involved interventions containing 
video feedback, where students were required to have 
video in order to be considered part of the experiment. 
These experiments had an initial question to check on 
whether or not the student could see the video. The 
answers to these questions were analyzed to determine 
if the student was able to see video. An additional 
column was added to the dataset to indicate if the 
student could see video where applicable. 


The two variables used for measuring student 
performance were whether or not the student 
completed the assignment and the mastery speed for 
those students who were able to complete the 
assignment. The number of problems it takes a 
student to complete a skill builder is also called mastery 
speed as defined in [10]. We believe these two 
variables are important measures, where mastery 
speed measures student learning and completion rate 
measures student persistence. There were a few cases 
where a small number of students did a large number 
of problems, which added skew to the distribution of 
mastery speed. A log transform was used so that these 
few students did not have large impact on the means 
for mastery speed. 

Data Description 

A description of the column names in the dataset are 
provided at the ASSISTments Data Dump Glossary 
webpage 1 . The dataset has 32 columns including a 
number of student covariates. The dataset is in the 
form of one row per student per assignment. Table 1 
shows some basic summary statistics of the dataset, 
including the number of problems done for each 
experiment and the number of unique students in each 
experiment. A total of 102,252 problems were 
attempted by 8,297 unique students across 22 different 
experiments. Note that a single student could have 
participated in more than one experiment and not all 
students experienced the conditions of the experiment. 
This dataset can be downloaded at 
https://sites.google.com/site/las2016data/data/ThisOn 
e.xlsx. 


1 http://www.assistmentstestbed.org/the-data/interpreting-your- 
data-v-1-0 




CONTRIBUTIONS, FUTURE WORK, AND 
CONCLUSIONS 

The contribution that this paper makes is the 
generation of a dataset consisting of 22 randomized 
controlled experiments run inside the ASSISTments 
online learning platform. What makes this dataset 
unique is that it is a combination of data from several 
different experiments all with a common design and 
common set of independent and dependent variables. 
We believe this dataset can be used for further analysis 
with a focus on finding general trends among 
independent variables across multiple experiments. It 
is currently being used in research to improve the 
detection of treatment effects in randomized controlled 
experiments [9]. 
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